library(tidyverse)
library(tidymodels)
library(naniar)
library(skimr)
library(kableExtra)The arts in all forms, from live jazz, to operas, plays, classical symphonies, musicals, and dance performances, are a ubiquitous facet of modern life. Museums, theaters, small music venues and symphony halls are present in cities across the country and the world. They are often integral parts of how we define both our residential spaces and our culture, because of the ways in which participation in and exposure to the arts bring joy, meaning, and community into our lives. A pressing subject that we as data scientists, lovers of the arts, and American citizens wish to investigate, however, is the link between participation in the arts in America and various demographic and social factors, particularly income. That is why for our predictive modeling project, we have decided to utilize a data-set from the Local Area Arts Participation study from 1992. This study was funded by the National Endowment for the Arts (Research Division) and conducted by Abt Associates of Cambridge, MA. It contains information on whether and how often Americans attended and participated in various kinds of local art, including jazz concerts, ballet performances, etc. It also contains various demographic characteristics of the respondents. In collecting the data, Abt Associates randomly selected 5,040 Americans from 12 localities (Broward County, Chicago, Dade County, Las Vegas, Pittsburgh, Reno, Rural Nevada, San Jose, Seattle, Sedona, Winston-Salem) via a random-digit-dialing sampling methodology and contacted them by phone for this survey, .
Thus, it is our intention to utilize the various variables regarding arts participation in this data-set, as well as demographic information like race, gender, and age, to predict the household income of the respondents in this survey. Hopefully, through the process of predicting respondent income, we can better understand who attends arts performances and exhibits, how often they do, and why other people don’t in order to eventually make the arts more accessible and appealing to everyone.
The data from the Local Area Arts Participation study was downloaded from the National Archive of Data on Arts and Culture (see appendix for citation). It required some minor cleaning before it was ready for use in this project. This cleaning involved first converting the data into a tibble format. We also had to change data values from NA to zero in certain variables in which, given the context, this made more sense. (For example, those respondents who said they had not attended a jazz concert in the last 12 months should have a corresponding value of zero, rather than just NA, for the next variable/question, in which the respondents were asked how many times they attended a jazz concert in the last 12 months. This was the case for all similarly formatted pairs of variables.) Further, in all categorical variables, we decided to convert levels of Don't know and Refused to NA. Also, for numeric variables, missing values were coded as numbers (as 98 and 99) and thus we had to be sure to correct that in order to read the data correctly. We also made a few other minor adjustments to the data, such as dropping a few variables that we knew from the get-go would be irrelevant to our project, re-coding the names of the levels of income (our outcome variable) for easier readability, and cleaning the names of our variables to be in snake case.
After cleaning, we wrote out our processed data-set into an .rda file, which is read in below. The data is named local_arts_data. There are 5,040 total observations and 88 attributes. 65 of the characteristics are factored data while the other 23 are numeric data. The categorical characteristics generally involve whether the respondent attended or participated in local art, as well as demographic data. The numerical features include the number of times the respondent attended that specific event as well as numeric demographic data.
# Load the Data
load(file = "data/processed/local_arts_data.Rda")After reading in the data, we performed an exploration of missingness. The results in the form of a table and a graph are shown below.
# Create function to select columns with missing values
has_na <- function(x){
any(is.na(x))
}
# table of missingness
local_arts_data %>%
# select only columns that have missing values
select_if(has_na) %>%
# summary of missing variables (from naniar)
miss_var_summary() %>%
print(n = Inf)## # A tibble: 80 x 3
## variable n_miss pct_miss
## <chr> <int> <dbl>
## 1 bar5 5029 99.8
## 2 bar4 4946 98.1
## 3 mags 4913 97.5
## 4 more8 4823 95.7
## 5 bar3 4612 91.5
## 6 more7 4385 87.0
## 7 radio 3943 78.2
## 8 more6 3738 74.2
## 9 tvtype 3642 72.3
## 10 bar2 3590 71.2
## 11 more5 2969 58.9
## 12 mostimp 2370 47.0
## 13 more4 2229 44.2
## 14 more3 1570 31.2
## 15 bar1 1483 29.4
## 16 newsp 1432 28.4
## 17 more1 1124 22.3
## 18 moremost 1124 22.3
## 19 more2 1009 20.0
## 20 over18 940 18.7
## 21 income 915 18.2
## 22 source1 361 7.16
## 23 rateinfo 322 6.39
## 24 age 226 4.48
## 25 ntvart 175 3.47
## 26 ntvjazz 143 2.84
## 27 ntvclass 137 2.72
## 28 ntvdance 133 2.64
## 29 ntvplay 128 2.54
## 30 race 118 2.34
## 31 ntvmus 116 2.30
## 32 educ 96 1.90
## 33 marital 95 1.88
## 34 nbooks 84 1.67
## 35 hhsize 73 1.45
## 36 gomore 67 1.33
## 37 npark 66 1.31
## 38 tvart 64 1.27
## 39 tvplay 56 1.11
## 40 ntvopera 53 1.05
## 41 tvmus 47 0.933
## 42 nmuseum 38 0.754
## 43 tvjazz 37 0.734
## 44 schools 34 0.675
## 45 nplay 32 0.635
## 46 park 29 0.575
## 47 tvdance 27 0.536
## 48 howimp 27 0.536
## 49 tvclass 25 0.496
## 50 nfair 22 0.437
## 51 play 21 0.417
## 52 nmusical 20 0.397
## 53 lisjazz 18 0.357
## 54 lisplay 18 0.357
## 55 nclassic 15 0.298
## 56 musical 15 0.298
## 57 njazz 14 0.278
## 58 hearpoet 13 0.258
## 59 museum 12 0.238
## 60 lismus 12 0.238
## 61 jazz 9 0.179
## 62 fair 8 0.159
## 63 tvopera 8 0.159
## 64 lisclass 8 0.159
## 65 lisopera 8 0.159
## 66 cinema 8 0.159
## 67 books 7 0.139
## 68 classic 6 0.119
## 69 nopera 5 0.0992
## 70 readplay 5 0.0992
## 71 readpoet 5 0.0992
## 72 opera 4 0.0794
## 73 nodance 4 0.0794
## 74 readnov 4 0.0794
## 75 hearnov 4 0.0794
## 76 nballet 3 0.0595
## 77 caseid 2 0.0397
## 78 odance 2 0.0397
## 79 ballet 1 0.0198
## 80 weight 1 0.0198
# graph missingness
local_arts_data %>%
# select only columns that have missing values
select_if(has_na) %>%
# create missingness graph (from naniar)
gg_miss_var()80 of the 88 variables in our data-set contain missingness, some with a considerable amount of missingness. As can be seen in the table, over 20% of the values of variables from bar5 down to moremost are NA. That’s 18 variables with more than 20% of their values missing. The majority of the variables with missingness, however, do not have more than 20% of their values missing. We postulated that there are three reasons that this missingness occurs:
The majority of NA values are due to the second reason—that the question did not apply to the respondent. For instance, the columns with the most missing values, bar5 and bar4, ask about the fourth and fifth reasons for why the respondent did not attend cultural and artistic events more often. However, the previous three questions, bar1, bar2, and bar3, ask the same question, and it’s likely that most respondents didn’t have more than three reasons for not attending more cultural/artistic events. Thus, variables bar5 and bar4 did not apply to the respondent given that their reasons for not attending more cultural/artistic events were covered in their responses to bar1, bar2, and bar3. Another example is mags, which asks what magazines the respondent read to learn about art events in the community. The response that a person gives to this question would be dependent whether the respondent read magazines and cited them as a source for learning about art events in the first place (ie, their responses to the source variables, which as where a respondent learned about arts events in their community). Thus, if people didn’t learn about arts events through magazines (which clearly few people did given how many missing values this variable has), then they would not respond to this question and their value for this variable would be NA. For those predictors with missing values that do not fall under the category of the question not applying to the respondent, it is more likely that the respondent simply did not know the answer to the question/how to respond to the question than that they refused to answer the question.
For the variables with more than 20% missingness, we decided it would make sense to remove most of them, as more than 20% of observations are a lot to impute. We decided to remove 14 predictors with a significant amount of missingness: bar5, bar4, mags, more8, bar3, more7, radio, more6, tvtype, bar2, more5, more4, more3, newsp. The code for that is below.
local_arts_data <- local_arts_data %>%
select(-bar5, -bar4, -mags, -more8, -bar3, -more7, -radio, -more6,
-tvtype, -bar2, -more5, -more4, -more3, -newsp)We decided to keep one variable that had greater than 20% missingness, specifically mostimp, as that question relates to the most important reason the participant did not attend arts events more often.
For the rest of the variables with missingness at or below 20%, an imputation step will be used to replace those missing values in our model recipes.
In our predictive modeling project, we will attempt to predict the survey respondent’s household income, the variable income, using as predictors the other variables in the data-set, which are related to arts attendence and demographic information. income is a categorical variable with 9 levels indicating the range of the respondent’s household income. Before conducting an initial split on the data into a training and testing set, we wanted to visualize the outcome variable to understand it’s distribution and check for problems.
# Distribution of income (graph)
local_arts_data %>%
ggplot(aes(x = income)) +
geom_bar() +
coord_flip() +
labs(
title = "Distribution of Respondent's Household Income"
) +
theme_minimal()# Table of the levels of income
local_arts_data %>%
count(income)## # A tibble: 10 x 2
## income n
## <fct> <int>
## 1 Under $10,000 315
## 2 $10,000 to $14,999 288
## 3 $15,000 to $19,999 375
## 4 $20,000 to $29,999 721
## 5 $30,000 to $39,999 679
## 6 $40,000 to $49,999 540
## 7 $50,000 to $74,999 698
## 8 $75,000 to $99,999 243
## 9 $100,000 or more 266
## 10 <NA> 915
As the graph and table above show, there is a significant amount of missingness in the outcome variable, with a total of 915 missing values. Most of this missingness is because participants didn’t know their household income or because the respondent refused to answer the question, with some NA values being truly missing responses. For the rest of the data, the income categories with the highest counts are $50,000 to $74,999, $30,000 to $39,999, and $20,000 to $29,999. Categories with the lowest counts are $75,000 to $99,000 and $100,000 or more. Before splitting the data into a training and test set, it would make sense to get rid of the rows that contain missing values, since it would not make sense (nor would it be possible) to predict missing values. The code for that is below.
# Only keep observations with values for income
local_arts_data <- local_arts_data %>%
filter(!is.na(income))We decided to initially split the data into 70% training, and 30% testing. The imbalance between the various classes of income shown above prove that it would be important to stratify our initial split (and later our cross-validation) by income. The training and testing sets are read in below. Code for how this was actually conducted can be found in the model_set_up.R script inside the “models” folder inside the “output” folder. The training data will be used to build our candidate models upon, and the testing data will be used as “new data” to test the final performance of our best model after tuning.
# load training data
load(file = "data/processed/local_arts_train.rda")
# load testing data
load(file = "data/processed/local_arts_test.rda")After conducting an initial split, the training data was also folded using 5-fold cross-validation with 3 repeats. These folds were used later during the tuning process to collect reliable information on model performance and avoid overfitting. Folding the training data allows us to more reliably assess the performance of our models, ie decrease the bias of our results, before applying them to the testing data because we are not assessing the model on the entire training data, but rather on multiple small subsets. When we fold the training data, we split it into, in this case, 5 groups. For each fold, four groups are kept as the analysis set, used to build and train the model on, and one group is left out as the assessment set, to assess the performance of the model on. A different group is left out as the assessment set each time to ensure that the model that is being tuned is evaluated across different subsets of the training data. Doing repeated cross-fold validation then simply means that this folding process is repeated again, in this case two more times, with the data shuffled differently into the 5 groups. Repeating the folding process reduces the variance in the performance values collected from the assessment sets of the folds. The folded training data is read in below. Code for how this was actually conducted can be found in the model_set_up.R script inside the “models” folder inside the “output” folder.
# load folded object
load(file = "data/processed/local_arts_fold.rda")Exploratory data analysis is a vital step in the modeling process. This is where relationships between the outcome and predictor variables are analyzed, and variables are examined for any problematic qualities that will need to be addressed in a model recipe.
After removing variables with a significant amount of missingness, there are 74 variables remaining, 73 of which may serve as predictor variables.
To begin our exploratory data analysis, we thought it would make sense to explore the relationship between income and the various variables that deal with attendence at/engagement with the arts. There are two sets of variables indicating participation in the arts. One set is categorical, indicating whether or not the participant has gone to the indicated arts event in the last 12 months. The other set is numeric, indicating how many times the participant has gone to the indicated arts event in the last 12 months. First, let’s look at how income is related to some of the categorical arts variables, ie whether or not a person has attended the specified arts event in the last 12 months.
# combined graph of categorical arts participation variables
local_arts_train %>%
select(income, jazz, classic, opera, musical, play, ballet, museum, fair, odance) %>%
pivot_longer(-income, names_to = "var", values_to = "value") %>%
ggplot(aes(x = income, fill = value)) +
geom_bar(position = "fill") +
coord_flip() +
facet_wrap(~ var) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45),
legend.position = "top") +
guides(fill = guide_legend(title.position = "top")) +
labs(
fill = "Have you attended (*) in the last 12 months?",
y = "Proportion of Participants",
x = "Income"
)The faceted graph above shows that for most arts activities, the number of people who attend them increases as income bracket increases. This pattern is particularly defined for attendance at museums and musicals. Only around 30% of those with incomes under 10,000 dollars attended museums within the year, while around 70% of those with incomes of 100,000 dollars or more attended museums within the year. Only about 15% of people with incomes under 10,000 dollars attended musicals within the year, while around 50% of those incomes of 100,000 dollars or more attended museums within the year. This pattern makes sense, given that arts events cost money (and sometimes quite a bit) to attend. Thus, those people with more financial ability and time are more likely to attend arts events.
Another categorical arts participation variable that we thought would be important to look at in relation to income was whether or not a person visited the movies in the past twelve months.
local_arts_data %>%
ggplot(aes(income, fill = cinema)) +
geom_bar(position = "fill") +
coord_flip() +
theme_minimal() +
labs(title = "Going to the movies by income",
fill = "Have you gone to movie theater\nto see a movie\nin last 12 months?",
y = "prop")The graph pretty clearly shows that people with higher incomes went to see a movie in the theaters more frequently in the last twelve months than people with lower incomes. cinema thus proves to be another important predictor variable to include along with the other categorical predictor variables related to arts participation shown above.
Next, let’s look at income in relation to the variables that indicate the number of times people attended the same arts events within the year.
# combined graph of numeric arts participation variables
local_arts_train %>%
select(income, starts_with("n")) %>%
select(., 1:10) %>%
pivot_longer(-income, names_to = "var", values_to = "value") %>%
ggplot(aes(x = value, y = income)) +
geom_boxplot() +
facet_wrap(~ var) +
theme_minimal() +
coord_cartesian(xlim = c(0, 10)) +
theme(axis.text.x = element_text(angle = 45)) +
labs(
title = "Frequency of Attendance at Arts Activities by Income",
x = "number of times attending"
)Its much harder to see a pattern with the numeric arts variables because there are so many people who did not attend the six above types of arts events at all within the year. However, there are a few variables in which a pattern with income is more clear. Again, there seems to a strong relationship between a person’s household income and the number of times they attended museums within the year, with those with a higher income attending museums more frequently. It also seems like people in the top few income brackets attended jazz and classical concerts, as well as musicals and plays, more frequently than those with less money.
Below is another graph of income in relation to more variables indicating the number of times the respondent participated in a specific activity in the last 12 months. nbooks refers to the number of books the respondent read; npark refers to the number of times the respondent visited a history park/monument; ntvart refers to the number of times the respondent watched a visual arts program on TV/VCR; ntvclass refers to the number of times the respondent watched a classical music program on TV/VCR; ntvdance refers to the number of times a respondent watched a dance program on TV; ntvjazz refers to the number of times a respondent watched a jazz program on TV; ntvmus refers to the number of times a respondent watched a musical on TV; ntvopera refers to the number of times a respondent watched opera on TV, and ntvplay refers to the number of times a respondent watched a play on TV.
# combined graph of numeric arts participation variables 2
local_arts_train %>%
select(income, starts_with("n")) %>%
select(., 1, 11:19) %>%
gather(-income, key = "var", value = "value") %>%
ggplot(aes(x = value, y = income)) +
geom_boxplot() +
facet_wrap(~ var) +
theme_minimal() +
coord_cartesian(xlim = c(0, 20)) +
theme(axis.text.x = element_text(angle = 45)) +
labs(
title = "Frequency of Engaging in/Watching Arts Activities by Income",
x = "number of times engaging"
)Again, for most of these arts activities, there doesn’t seem to be a noticeable pattern between the number of times the participant engaged with that activity and their household income. nbooks, however, is a exception. It’s very clear that participants with higher household incomes read more books within the year. There also seems to be a slight relationship between npark and income, with people with higher yearly incomes attending parks more frequently that those with lower yearly incomes.
It would also make sense to look at the distributions of these numeric variables related to attendance at, and engagement with, arts activities. Analyzing the distribution of numeric variables in one’s data-set is important in the pre-modeling EDA process. This is because many models are very sensitive to skewness, and thus skewness in numeric data must be identified and dealt with before the data can be fed to the models. Below, we selected just these numeric arts variables, and conducted a skim with charts.
local_arts_train %>%
select(starts_with("n")) %>%
skim()| Name | Piped data |
| Number of rows | 2886 |
| Number of columns | 18 |
| _______________________ | |
| Column type frequency: | |
| numeric | 18 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| njazz | 8 | 1.00 | 0.76 | 4.31 | 0 | 0 | 0 | 0 | 97 | ▇▁▁▁▁ |
| nclassic | 7 | 1.00 | 0.68 | 3.80 | 0 | 0 | 0 | 0 | 97 | ▇▁▁▁▁ |
| nopera | 2 | 1.00 | 0.10 | 0.48 | 0 | 0 | 0 | 0 | 8 | ▇▁▁▁▁ |
| nmusical | 13 | 1.00 | 0.65 | 1.61 | 0 | 0 | 0 | 1 | 30 | ▇▁▁▁▁ |
| nplay | 13 | 1.00 | 0.58 | 2.02 | 0 | 0 | 0 | 0 | 50 | ▇▁▁▁▁ |
| nballet | 2 | 1.00 | 0.15 | 0.80 | 0 | 0 | 0 | 0 | 25 | ▇▁▁▁▁ |
| nodance | 4 | 1.00 | 0.35 | 2.78 | 0 | 0 | 0 | 0 | 97 | ▇▁▁▁▁ |
| nmuseum | 17 | 0.99 | 2.43 | 8.34 | 0 | 0 | 0 | 2 | 97 | ▇▁▁▁▁ |
| nfair | 11 | 1.00 | 1.65 | 4.04 | 0 | 0 | 1 | 2 | 97 | ▇▁▁▁▁ |
| npark | 34 | 0.99 | 2.72 | 9.36 | 0 | 0 | 0 | 2 | 97 | ▇▁▁▁▁ |
| nbooks | 35 | 0.99 | 14.00 | 22.83 | 0 | 1 | 5 | 15 | 97 | ▇▁▁▁▁ |
| ntvjazz | 71 | 0.98 | 1.93 | 7.02 | 0 | 0 | 0 | 2 | 97 | ▇▁▁▁▁ |
| ntvclass | 61 | 0.98 | 2.10 | 7.12 | 0 | 0 | 0 | 2 | 97 | ▇▁▁▁▁ |
| ntvopera | 21 | 0.99 | 0.60 | 2.80 | 0 | 0 | 0 | 0 | 97 | ▇▁▁▁▁ |
| ntvmus | 49 | 0.98 | 1.02 | 4.58 | 0 | 0 | 0 | 1 | 97 | ▇▁▁▁▁ |
| ntvplay | 64 | 0.98 | 1.19 | 5.43 | 0 | 0 | 0 | 1 | 97 | ▇▁▁▁▁ |
| ntvdance | 57 | 0.98 | 1.58 | 6.15 | 0 | 0 | 0 | 2 | 97 | ▇▁▁▁▁ |
| ntvart | 89 | 0.97 | 3.07 | 9.86 | 0 | 0 | 0 | 3 | 97 | ▇▁▁▁▁ |
Every variable is very clearly quite right skewed. This is because each variable has the bulk of it’s data around zero, a mean between zero and 15, and then data trailing off to some very high outliers. Furthermore, the small histograms shown in the skim display an asymmetric distribution for each of the selected variables. This skewness, as mentioned above, will need to be addressed in the modeling recipes.
Another set of variables we thought would be important to explore in this data analysis are those variables relating to the demographics of the respondent, such as age, gender, race, education level and marital status. Each of these demographic variables seem as if they could provide valuable insight into the income of the participant. For example, income generally increases when a person has more education education. There is also usually a correlation between race and income, with white households having higher household incomes than households of color.
We began by looking at the relationship between income and the education level of the participant.
local_arts_train %>%
select(income, educ, race, gender, age, hhsize) %>%
ggplot(aes(income, fill = educ)) +
geom_bar(position = "fill") +
coord_flip() +
theme_minimal() +
labs(
title = "Distribution of Education Levels across Income Brackets",
fill = "Highest Education Level",
y = "prop"
)As hypothesized, there appears to be a correlation between education level and income. The proportion of people with bachelor’s degrees and with grad degrees increases as income increases.
Since there are multiple levels of educ, it would make sense to visualize the distribution of this predictor.
# distributin of educ
local_arts_train %>%
mutate(
educ = educ %>% fct_infreq() %>% fct_rev()
) %>%
ggplot(aes(educ)) +
geom_bar() +
coord_flip() +
labs(
x = "Highest Education Level"
) +
theme_minimal()The levels with the highest counts are Some college, High school, and Bachelors degree. There are a few levels, Some grad. school, Vocational school, Grades k-8, No school, and Other, that have very few values and could pose a problem for modeling due to dummy coding. This will be addressed in our recipe by the addition of a step_other step in our recipe that combines infrequently occurring levels into a single other level.
The next demographic factor we explored in relation to income is gender.
local_arts_train %>%
select(income, educ, gender, age, hhsize) %>%
ggplot(aes(income, fill = gender)) +
geom_bar(position = "fill") +
coord_flip() +
theme_minimal() +
labs(
title = "Distribution of Gender across Income Brackets",
fill = "Gender",
y = "prop"
)As the income bracket increases, the proportion of women with that income decreases, and the proportion of men with that income increasse. This is unsurprising, given the persistence of sexism in the workforce, particularly the gender pay gap and the glass ceiling in many top-tier, high-paying jobs.
Next, we explored the relationship between household income and age.
ggplot(local_arts_train, aes(age)) +
geom_histogram(binwidth = 5) +
theme_minimal() +
facet_wrap(~ income) +
labs(
title = "Distribution of Age across Income Brackets",
x = "Age",
y = "Count"
)As the income bracket increases, it seems like the distribution of ages shifts slightly upwards. Meaning as income bracket increases, so does the average age of participants. This would make sense: younger adults tend to have less money due to their lack of experience. Older adults have more experience in the workforce and may be more financially responsible, leading to a higher incomes.
We also explored the relationship between a respondent’s marital status and their income.
local_arts_train %>%
ggplot(aes(income, fill = marital)) +
geom_bar(position = "fill") +
coord_flip() +
theme_minimal() +
labs(
title = "Distribution of Marital Status across Income Brackets",
y = "prop",
fill = "Marital Status"
)The graph shows that there is very clearly a relationship between a respondent’s marital status and their incomme. There are a higher proportion of married people with higher incomes, and a higher proportion of single people with lower incomes.
Finally, the last demographic variable we explored was race. First, we looked at the relationship between a participant’s race and their income.
local_arts_train %>%
ggplot(aes(income, fill = race)) +
geom_bar(position = "fill") +
coord_flip() +
theme_minimal() +
labs(
title = "Race by Income",
fill = "Count",
y = "prop"
)The graph very clearly shows that as the income bracket increases, so do the number of white people with that as their household income. There is a larger proportion of people of color represented in the lower income brackets than in the higher income brackets.
We also thought, since there are so many levels of the variable race, that it would make sense to look at the distribution of race itself.
local_arts_train %>%
mutate(
race = race %>% fct_infreq() %>% fct_rev()
) %>%
ggplot(aes(race)) +
geom_bar() +
coord_flip() +
theme_minimal()White folks vastly outnumber folks of other races in this dataset. There is very little diversity in this variable, which will be a problem when dummy encoding. This will be addressed in our recipe by the addition of a step_other step in our recipe that combines infrequently occurring levels into a single other level.
There were a few final things that we explored in our data-set prior to modeling. Given how many people responded that they had not attended or participated in a given arts event within the past year, we thought it would be interesting to look at some of the reasons why this is the case. Below is the distribution of the variable bar1, which indicates respondent’s first reason for not attending local arts events/not attending them more often.
local_arts_train %>%
mutate(
bar1 = bar1 %>% fct_infreq() %>% fct_rev()
) %>%
ggplot(aes(bar1)) +
geom_bar() +
coord_flip() +
labs(
x = "reason for not attending"
) +
theme_minimal()As you can see above, the most common reason for people not attending local arts events more frequently is that they don’t have time, followed by the cost of tickets and lack of childcare. Note also that most of the levels of bar1, except for (02) Don't have time, have only very few values. This could be a problem when dummy encoding for modeling as it will create variables with little to no variance. Thus, this problem will be dealt with via the addition of a step_zv step in our recipes to get rid of zero variance variables. It could also be interesting to see how these reasons for not attending local arts events change (or remain the same) across levels of income.
# distribution of bar1 by income
local_arts_train %>%
mutate(
bar1 = bar1 %>% fct_infreq() %>% fct_rev()
) %>%
ggplot(aes(bar1)) +
geom_bar() +
facet_wrap(~ income) +
coord_flip() +
labs(
x = "reason for not attending"
) +
theme_minimal()It seems that across all income levels, lack of time is the primary reason for not attending local arts events more frequently.
Finally, using the caret package, we used nearZeroVar() to identify which variables truly had little to no variance. Such values would fail to provide meaningful insight or significantly affect the model. These variables were caseid, site, abtid, and exchange, which were all meaningless administrative variables, as well as source2, source3, source4, source5 which record the 2nd, 3rd, 4th, and 5th answers given by respondents as to their source of information about local arts events. All eight of these variables were removed from our initial data-set before splitting.
Ultimately, after removing these last eight variables, we decided to utilize all of the 65 remaining variables in our data-set as predictors. Although we didn’t explore every single variable in this data-set in our exploratory data analysis, from what we did explore there seems to be at least a mild relationship between most of the possible predictor variables and our outcome, income. The next step, then, after conducting an exploratory data analysis was to choose the types of models we wanted to tune, and then build pre-processing recipes in order to prepare our data for modeling.
After exploring the data, an important next step in the modeling process is preparing the data to be fed to the models. No model can perform particularly well if the data is left in its raw condition, and thus feature engineering is used to perform various transformations on the data to make it as amenable to the models as possible. This can be done with a recipe using the recipes package from tidymodels. We used the same exact recipe for four of our models: the k-nearest neighbors, neural net, elastic net, and support vector machine. This recipe contained an initial step in which the recipe was established (using the recipe function). This step indicates income as our outcome, and all other variables as predictors. Next, the recipe contains an imputation step for missingness using bagged-tree models via step_bag_impute. Then we added step_YeoJohnson for all numeric variables. This transformation will help deal with the skewness found in our exploratory data analysis by making numeric predictors symmetrical. Next, we added two step_other steps, one for educ and one for race, which both had levels with very few values in them. These two steps indicate that, for both educ and race, the bottom 1% of levels should be combined into a single other level. Then, we added step_dummy to dummy-code all categorical variables. Dummy coding means that every categorical variable with n levels becomes n - 1 numerically encoded variables with values of 0 (if the observation does not take that level) or 1 (if the observation does take that level). We made sure to indicate that our outcome variable was not included in this dummy coding step even though it too is a categorical variable. Models will not run if the outcome variable is dummy coded. It needs to remain as it is. Next, we added step_normalize to center and scale all predictors. Centering the data places the mean at zero, and scaling standardizes the units of the data so that all variables have a standard deviation of 1. Scaling in particular is beneficial for nearest neighbors models, because they use distance metrics to calculate the k-nearest points to the new observation, and thus benefit from all variables being in the same unit. Similarly, elastic net models benefit from all variables being in the same unit to ensure the penalty used by the model effects all the variables in the same way. Finally, step_zv was included to remove any zero variance predictors after dummy coding. Zero-variance predictors result from levels of a categorical variable that occur so infrequently that they simply aren’t present in some re-samples. Thus, the dummy-encoded variables representing those levels in that re-sample contain only zeros. These variables hold no predictive information, and thus can cause problems for our model, which is why it is important to remove them. The code for this recipe is printed below.
# recipe used for knn, neural net, elastic net and support vector machine
recipe1 <-
# set outcome variable (income) and predictors (all other variables)
recipe(income ~ ., data = local_arts_train) %>%
# impute missing data
step_impute_bag(all_predictors()) %>%
# Yeo-Johnson transform numeric variables to deal with skewness
step_YeoJohnson(all_numeric()) %>%
# create an other category for infrequently occuring levels
# of educ and race
step_other(educ, threshold = 0.01) %>%
step_other(race, threshold = 0.01) %>%
# dummy encode categorical predictors
step_dummy(all_nominal(), -all_outcomes()) %>%
# normalize all predictors
step_normalize(all_predictors()) %>%
# remove zero-variance predictors
step_zv(all_predictors())The recipes for the last two models, the tree based models, were almost exactly the same as the above model, except for how we encoded categorical variables. For the boosted tree model, we decided to one-hot encode the categorical variables instead of simply dummy encoding. Instead of converting categorical variables with n levels into n-1 numeric variables, one-hot encoding converts every level of a categorical variable into it’s own variable with numeric values. We did this because, unlike models with linear dependencies, tree-based models can handle having all levels of a categorical variable represented as separate variables, and we thought it would be interesting to try out this kind of encoding to see how well the model performs. That recipe is below.
# recipe used for boosted tree model
bt_recipe <-
# set outcome variable (income) and predictors (all other variables)
recipe(income ~ ., data = local_arts_train) %>%
# impute missing data
step_impute_bag(all_predictors()) %>%
# Yeo-Johnson transform numeric variables to deal with skewness
step_YeoJohnson(all_numeric()) %>%
# create an other category for infrequently occuring levels
# of educ and race
step_other(educ, threshold = 0.01) %>%
step_other(race, threshold = 0.01) %>%
# one-hot encode categorical predictors
step_dummy(all_nominal(), -all_outcomes(), one_hot = TRUE) %>%
# normalize all predictors
step_normalize(all_predictors()) %>%
# remove zero-variance predictors
step_zv(all_predictors())Finally, for the random forest model, we did not dummy encode the categorical variables at all. This is because random forest models are one of the types of models that can handle categorical variables without converting them to a numeric form. Although leaving categorical variables the way they are is not proven to necessarily increase the performance of the model, it can help the model run faster, and so that is why we decided not to dummy code the categorical variables for the random forest. The recipe used for the random forest model is below.
# recipe used for random forest model
rf_recipe <-
# set outcome variable (income) and predictors (all other variables)
recipe(income ~ ., data = local_arts_train) %>%
# impute missing data
step_impute_bag(all_predictors()) %>%
# Yeo-Johnson transform numeric variables to deal with skewness
step_YeoJohnson(all_numeric()) %>%
# create an other category for infrequently occuring levels
# of educ and race
step_other(educ, threshold = 0.01) %>%
step_other(race, threshold = 0.01) %>%
# normalize all numeric predictors
step_normalize(all_numeric()) %>%
# remove zero-variance predictors
step_zv(all_predictors())After tuning our six different models, we evaluated how each model performed across the folds. To evaluate performance, the metric we decided to use was the area under the receiver operating curve, or roc-auc. Roc-auc gives us a measure of how well our model can differentiate between the different classes of our outcome variable. An roc-auc value closer to one means the model performs better. An roc-auc value of 0.5 would mean that the model performs no better than if we had simply predicted the outcome values at random.
Below is a table showing the roc-auc value, and standard error, of the best performing model of each model type across the folds, arranged in descending order by roc-auc.
# create tibble
best_models <- tribble(
~`Model Type`, ~`Roc-auc`, ~`Standard error`,
#----------|---------|----------------|
"Random Forest", 0.712, 0.00441,
"Elastic Net", 0.684, 0.00264,
"Boosted Tree", 0.688, 0.00265,
"Support Vector Machine", 0.687, 0.00306,
"Neural Network", 0.591, 0.0128,
"K-Nearest Neighbors", 0.596, 0.00385
)
# present the table
best_models %>%
# arrange in descending order of roc-auc
arrange(desc(`Roc-auc`)) %>%
# apply table styling functions from kable package
kbl() %>%
kable_styling()| Model Type | Roc-auc | Standard error |
|---|---|---|
| Random Forest | 0.712 | 0.00441 |
| Boosted Tree | 0.688 | 0.00265 |
| Support Vector Machine | 0.687 | 0.00306 |
| Elastic Net | 0.684 | 0.00264 |
| K-Nearest Neighbors | 0.596 | 0.00385 |
| Neural Network | 0.591 | 0.01280 |
As you can see, the random forest model performs the best out of all the model types, producing an roc-auc value of 0.712. Furthermore, there is no overlap between standard error of the best random forest and the standard error of the best boosted tree, the second highest performing model, do not overlap. This gives us conclusive evidence that the random forest model is our best performing model, and thus the model that we will use to predict the outcome values on our test data. Before we move on, however, to finalizing our best random forest model workflow and applying it to the test data, let’s get a sense of what the parameter values are that created the best random forest model. Below, I read in the tuned object of the random forest model from it’s .rds file, and then using the autoplot() function produced a graphical representation of the performance of the random forest model across different possible parameter values.
# read in rf_tuned_final
load(file = "output/models/rf_tuned.rda")
# autoplot
# specify roc_auc as my performance measure
autoplot(rf_tuned, metric = "roc_auc")The auto-plot shows that the random forest model consistently produced the best roc-auc value with a minimal node size (min_n) of 40, except for a small overlap with a min_n value of 30 at an mtry value of 21. The autoplot also shows that roc-auc increased as the number of randomly selected predictors in a tree, mtry increased from 10 to 30, and then dropped off again between 30 and 56. Thus, the combination of parameters that produced the best random forest model was a min_n of 40 and an mtry value of 33, as you can see in the table below. (Note: it appears as though all five models performed the same because their roc-auc values were rounded to the thousandth decimal. But the model with an mtry of 33 and a min_n of 40 is at the top).
# show best performing random forest models
rf_tuned %>%
show_best(metric = "roc_auc")## # A tibble: 5 x 8
## mtry min_n .metric .estimator mean n std_err .config
## <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 33 40 roc_auc hand_till 0.712 15 0.00441 Preprocessor1_Model23
## 2 33 21 roc_auc hand_till 0.712 15 0.00455 Preprocessor1_Model13
## 3 44 40 roc_auc hand_till 0.712 15 0.00445 Preprocessor1_Model24
## 4 21 30 roc_auc hand_till 0.712 15 0.00437 Preprocessor1_Model17
## 5 21 40 roc_auc hand_till 0.712 15 0.00430 Preprocessor1_Model22